Tokenization

Introduction

Tokenization is the very first step before any numbers enter a language model — it converts raw words into subwords. Subwords are the building blocks of words. For example, the word tokenization can be decomposed into the subwords toke, niza, and ation as seen in the image at the top of the page so tokenization → toke + niza + tion.

Why not just use characters or words?

Characters are too granular — you’d need thousands of steps to process a sentence, and the model would have to learn spelling from scratch. Whole words are too coarse — English has millions of word forms, and the model would have no way to handle anything it hasn’t seen before. Subword tokenization is the middle ground: common words get one token, rare words are decomposed into familiar pieces.

How BPE works

Byte-Pair Encoding (BPE) is the most popular tokenization technique today. It starts with a vocabulary of individual bytes/characters, then repeatedly scans the training corpus for the most frequent adjacent pair and merges it into a new token. This process runs hundreds of thousands of times. The result is a vocabulary where frequent words and subwords are compressed into single tokens, while rare combinations fall back to shorter pieces.

Some Points to Note

Misspellings

The tokenizer never sees the original text again — it’s permanently compressed before entering the model. A single typo like teh instead of the produces a completely different token (or multiple tokens), which is part of why LLMs are more sensitive to misspellings than humans are.

Math

Numbers get split into small chunks, so arithmetic requires the model to carry meaning across multiple tokens — one reason models historically struggle with math.

Non-English

Non-English languages tend to use 2–4× more tokens per word than English, since the vocabulary was built primarily from English text. Try the Non-English example in the demo above to see what byte-level fallback looks like, or click any preset to explore how different text types tokenize very differently.